Random Effects Models and Panel Data

Autocorrelation

Possible fixes (that don’t require throwing out data )

  • Clustered Standard Errors: using either the bootstrap or a sandwich estimator to adjust the standard errors up

  • Fixed effects: make a dummy for each group and include it in the model, (potentially) eliminating the autocorrelation while also controlling for unobserved differences across groups

  • Random effects

“Random” vs. Fixed

(the terminology here is inconsistent and contested, but no one has come up with better words yet)

  • Fixed Effects Models give every group its own y-intercept. The effects of “Texas” vs. “Tennessee” are assumed to be fixed and unrelated.

    • In this sense, you’re usually assuming a fixed effect any time you run a standard OLS model.
  • Random Effects Models assume that group intercepts are random draws from a common distribution. They can vary just like any other random variable, but they vary in a predictable way

An example: students in classrooms

  • You want to estimate the effect of time spent studying on student achievement

  • But you also want to account for differences between teachers: you suspect some are slightly better than others.

flowchart LR
Z[School] --> A
Z --> G
Z --> L
A[Classroom 1] --> B(Student A)
A --> C(Student B)
A --> E(Student D)
G[Classroom 2] --> H(Student H)
G --> I(Student I)
G --> K(Student K)
L[Classroom 3] --> M(Student M)

An example: students in classrooms

You calculate some averages and your results look like this:

Instructor Number of students Average SAT score
Teacher 1 50 1029
Teacher 2 25 1020
Teacher 3 5 1120

Under a fixed effects assumption, each teacher’s effect is a fixed parameter, and knowing the mean score for teacher 1 and 2 doesn’t tell you anything about what the mean score should be for teacher 3.

Under a random effects assumption, the teacher effect is drawn from a larger distribution that has a central tendency and a variance. So you might look at the value for teacher 3 and conclude: “this seems implausible, in reality, this effect should probably be closer to the effect for the other two”

An example: students in classrooms

So we could assume something like this:

\[ SAT = \text{teacher effect} + b\_\text{studying effect} \times \text{study time} + \epsilon \] \[ \epsilon \sim \mathcal{N}(0, \sigma) \]

But we could assume a model with some kind of a “teacher effect” that also had a normal distribution like the error term:

\[ SAT = (\text{teacher effect}) + b\_\text{studying effect} \times \text{study time} + \epsilon\]

\[ \text{teacher effect} \sim \mathcal{N}(0, \tau) \]

In the second case, we estimate a model that slightly biases teacher effects towards the grand mean

Random Effects: pooling

One way to think about a random effect is that it represents a compromise between “no pooling” and “complete pooling” of regression results.

  • “No pooling” would mean estimating a totally separate model for each teacher

  • “Complete pooling” would mean estimating a model that totally ignores teacher effects.

  • “Partial pooling” would mean estimating a model that uses a weighted average of these cases

Random Effects: pooling

A partial pooling estimate favors “no pooling” estimate if group \(j\) is large and low-variance, and favors the “complete pooling” case if group \(j\) is small or high variance.

\[ \text{Estimate of random intercept } \alpha_j \approx \frac{\frac{n_j}{\sigma^2_j}}{\frac{n_j}{\sigma^2_j} + \frac{1}{\sigma^2_j}} (\bar{y_j} - \beta\bar{x}_j) + \frac{\frac{1}{\sigma^2_a}}{\frac{n_j}{\sigma^2_y} + \frac{1}{\sigma^2_a}} \mu_a \]

Random Effects: pooling

Random Effects: pooling

  • Random effects models will often give more plausible estimates for cases with small sample sizes - because they “pool” extreme values with small samples.

    • Think of it like estimating a batting average for a player with 2 at-bats. Is plausible they’re batting 100? Or should we assume they’re around the league average until we get more information?

Random Effects

library(fixest)
# county level election results

fmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp  | state_name, data=counties)
regmodel<-feols(perc_gop ~ Income + White + AfAm + Hisp , 
                data =counties, cluster ~ state_name)

mlist<-list("No fixed effects" = regmodel, 
            'Fixed effects' = fmodel)
modelsummary(mlist, 
             estimate  = "{estimate}",  
             statistic = c("conf.int", 'std.error'),
             conf_level = .95,
             coef_rename = coefnames,
             coef_omit ='Intercept',
             gof_map = c('nobs','aic','bic', 'vcov.type',
                         'FE: state_name'
             )
)
tinytable_qkp1m9y3kq3rd58ohedp
No fixed effects Fixed effects
Median income ($1000) -0.374 -0.257
[-0.444, -0.304] [-0.308, -0.206]
(0.035) (0.025)
% White 0.574 0.592
[0.399, 0.748] [0.525, 0.659]
(0.087) (0.034)
% African American -0.047 -0.249
[-0.247, 0.153] [-0.362, -0.135]
(0.100) (0.057)
% Hispanic 0.427 0.275
[0.181, 0.673] [0.152, 0.398]
(0.122) (0.061)
Num.Obs. 3115 3115
AIC 24166.5 22175.2
BIC 24196.7 22507.6
Std.Errors by: state_name by: state_name
FE: state_name X

Random Effects

To specify a random effect, we can use the lme4 package. The (1|state_name) says that I want to estimate random intercepts for each state:

library(lme4)
# fixed effects version for comparison:
fixefmodel<-fixest::feols(perc_gop ~ Income + White + AfAm + Hisp  | state_name, 
                          data=counties)

ranefmodel<-lmer(perc_gop ~ Income + White + AfAm + Hisp + (1|state_name),
               data=counties
               )

Random Effects

tinytable_7zbbbu8rerfv17hqpjr8
No fixed effects Fixed effects Random effects
Median income ($1000) -0.374 -0.257 -0.260
[-0.444, -0.304] [-0.308, -0.206] [-0.281, -0.239]
(0.035) (0.025) (0.011)
% White 0.574 0.592 0.588
[0.399, 0.748] [0.525, 0.659] [0.543, 0.633]
(0.087) (0.034) (0.023)
% African American -0.047 -0.249 -0.249
[-0.247, 0.153] [-0.362, -0.135] [-0.305, -0.194]
(0.100) (0.057) (0.028)
% Hispanic 0.427 0.275 0.272
[0.181, 0.673] [0.152, 0.398] [0.219, 0.326]
(0.122) (0.061) (0.027)
SD (Observations) 8.430
Num.Obs. 3115 3115 3115
AIC 24166.5 22175.2 22370.0
BIC 24196.7 22507.6 22412.3
Std.Errors by: state_name by: state_name
FE: state_name X

Random Effects: additional covariates

With fixed effects models, we can’t include any controls for group-level covariates because they’re already in the “fixed” effect:

library(rvest)
page<-read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union')

join_date<-page|>
  html_element(css='table')|>
  html_table()|>
  select(2:4)|>
  mutate(datestring = str_extract(
    `Date(admitted or ratified)`, "([A-Z][a-z]+ [0-9]{1,2}, [0-9]{4})"),
         date = mdy(datestring)
  )
counties<-counties|>
  left_join(join_date, by=join_by(state_name==State))|>
  mutate(join_year = year(date) - 1787
         
         )|>
  drop_na(join_year)

fixefmodel<-fixest::feols(perc_gop ~ Income + White + AfAm + Hisp + join_year  | state_name, 
                          data=counties)

fixefmodel
OLS estimation, Dep. Var.: perc_gop
Observations: 3,114
Fixed-effects: state_name: 50
Standard-errors: Clustered (state_name) 
        Estimate Std. Error   t value   Pr(>|t|)    
Income -0.257136   0.025502 -10.08291 1.5294e-13 ***
White   0.591932   0.033514  17.66219  < 2.2e-16 ***
AfAm   -0.248722   0.056598  -4.39450 5.9451e-05 ***
Hisp    0.274869   0.061121   4.49710 4.2332e-05 ***
... 1 variable was removed because of collinearity (join_year)
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8.35604     Adj. R2: 0.713165
                Within R2: 0.567463

Random Effects: additional covariates

But random effects models can still estimate effects for group-level characteristics:

ranefmodel<-lmer(perc_gop ~ Income + White + AfAm + Hisp + join_year + (1|state_name),
               data=counties
               )

summary(ranefmodel)
Linear mixed model fit by REML ['lmerMod']
Formula: perc_gop ~ Income + White + AfAm + Hisp + join_year + (1 | state_name)
   Data: counties

REML criterion at convergence: 22348.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.1757 -0.4745  0.0931  0.5828  3.3831 

Random effects:
 Groups     Name        Variance Std.Dev.
 state_name (Intercept) 113.01   10.63   
 Residual                71.06    8.43   
Number of obs: 3114, groups:  state_name, 50

Fixed effects:
            Estimate Std. Error t value
(Intercept) 31.27089    3.36819   9.284
Income      -0.25919    0.01074 -24.124
White        0.59241    0.02327  25.463
AfAm        -0.24204    0.02853  -8.482
Hisp         0.27559    0.02739  10.061
join_year    0.06814    0.03294   2.068

Correlation of Fixed Effects:
          (Intr) Income White  AfAm   Hisp  
Income    -0.276                            
White     -0.693  0.072                     
AfAm      -0.630  0.154  0.830              
Hisp      -0.584  0.061  0.816  0.691       
join_year -0.576  0.018  0.095  0.105  0.058

Random Effects: random slopes

We can also estimate random slopes using this same basic approach.

ranefmodel<-lmer(perc_gop ~  White + AfAm + Hisp + join_year + (Income |state_name),
               data=counties
               )

library(marginaleffects)
nd<-datagrid(model=ranefmodel, 
             state_name = c("Alabama", "Wyoming", "Maryland", "Virginia", "Nevada", "North Carolina"), 
             "Income" = seq(17, 170, by=20))

plot_predictions(ranefmodel, newdata=nd, by=c('Income', 'state_name')) +
  theme_bw() +
  labs(color ='State', fill='State',  y='% GOP vote')

Random Effects: random slopes

What to use

  • Fixed Effects model are best when:
    • You want to control for un-modeled differences across groups
    • You think groups are fundamentally “different”
    • You don’t want to control for any group-level characteristics
  • Random Effects models are preferable when:
    • There are lots of small groups with very few members
    • You’re less worried about unmeasured heterogeneity between clusters
    • You want to include characteristics across different “levels” or even allow the slopes to vary or certain observations

Time

  • OLS assumes that data are independent, but your height today is not independent of your height yesterday. This violates the assumption of independence of errors.
  • Moreover, time be a confounding variable: if two things are impacted by time, they’ll appear correlated in a regression that fails to account for a trend.

Problems with time

The democracy score of Paraguay in 2000 is a good predictor of current democracy scores in Paraguay.

vdem<-vdemdata::vdem|>
  filter(e_regionpol == "2")|>
  select(country_name, year, v2x_polyarchy, e_gdppc, e_pop)

vdem|>
  filter(country_name=='Paraguay')|>
ggplot(aes(x=year, y=v2x_polyarchy)) +
  geom_line()  +
  facet_wrap(~country_name) +
  theme_minimal()

Problems with time

Lack of independence poses similar problems to what we observed with clustered samples: we will generally overstate our level of confidence in estimates. We actually have very few truly independent observations here.

model<-vdem|>
  filter(country_name=='Paraguay')|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ log(e_gdppc), data=_)

huxreg(model)
(1)
(Intercept)0.278 **
(0.086)  
log(e_gdppc)0.112 **
(0.032)  
N25       
R20.347   
logLik53.262   
AIC-100.524   
*** p < 0.001; ** p < 0.01; * p < 0.05.

Problems with time

Time trends can also be a source of spuriousness. If any two variables have a time trend, they’ll appear to be related in a naive regression.

vdem|>
ggplot(aes(x=log(e_pop), y=v2x_polyarchy)) +
  geom_line()  +
  geom_smooth(method='lm', se=FALSE)+
  facet_wrap(~country_name) +
  theme_minimal() +
  xlab('logged population') +
  ylab('polyarchy score')

Potential fixes

  • Correcting the standard errors uses the similar methods to the ones we’ve discussed elsewhere.

  • We could control for time linearly or with a polynomial or spline function.

  • We could treat time as a factor and include it as a fixed effect.

  • Would could treat time as a random effect term.

Potential fixes: time coefficients

model1<-lm(v2x_polyarchy ~ log(e_pop), data=vdem)

model2<-lm(v2x_polyarchy ~ log(e_pop) + year, data=vdem)

model3<-lm(v2x_polyarchy ~ log(e_pop) + splines::ns(year), data=vdem)


mlist<-list(model1, model2, model3)

modelsummary(mlist)
tinytable_4gn6i8dekm57nf02xtqe
(1) (2) (3)
(Intercept) -0.116 -3.979 -0.077
(0.013) (0.102) (0.011)
log(e_pop) 0.068 0.011 0.011
(0.002) (0.002) (0.002)
year 0.002
(0.000)
splines = ns(year) 0.639
(0.017)
Num.Obs. 3879 3879 3879
R2 0.196 0.415 0.415
R2 Adj. 0.196 0.415 0.415
AIC -1607.0 -2838.9 -2838.9
BIC -1588.2 -2813.8 -2813.8
Log.Lik. 806.514 1423.428 1423.428
RMSE 0.20 0.17 0.17

Potential fixes: fixed effects

If we have panel data (multiple groups over time) we can include year as a fixed effect. This has the effect of controlling for anything that varies for all groups over time.

fmodel1<-feols(v2x_polyarchy ~ log(e_pop) | year, data=vdem)
modelsummary(fmodel1)
tinytable_76eqavjk5y8jk685h1c2
(1)
log(e_pop) 0.003
(0.001)
Num.Obs. 3879
R2 0.568
R2 Adj. 0.540
R2 Within 0.001
R2 Within Adj. 0.000
AIC -3557.6
BIC -2104.5
RMSE 0.14
Std.Errors by: year
FE: year X

Potential fixes: fixed effects

Since we have both time periods AND groups, we can also add country fixed effects:

fmodel1<-feols(v2x_polyarchy ~ log(e_pop) | year, data=vdem)
fmodel2<-feols(v2x_polyarchy ~ log(e_pop) | country_name + year, data=vdem)

mlist<-list(fmodel1, fmodel2)

modelsummary(mlist)
tinytable_glpz91sqs2jfz72v14n8
(1) (2)
log(e_pop) 0.003 0.101
(0.001) (0.037)
Num.Obs. 3879 3879
R2 0.568 0.739
R2 Adj. 0.540 0.721
R2 Within 0.001 0.054
R2 Within Adj. 0.000 0.053
AIC -3557.6 -5470.2
BIC -2104.5 -3898.1
RMSE 0.14 0.11
Std.Errors by: year by: country_name
FE: year X X
FE: country_name X

Potential fixes: differencing

  • If data depends only on t-1, taking the first difference should address it

  • difference = current_value - prior_value

Differences

vdem|>
  filter(country_name=='Paraguay')|>
  mutate(polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
         group='differenced'
         )|>
  bind_rows(vdem|>filter(country_name=="Paraguay")|>
              mutate(group='original',
            polyarchy = v2x_polyarchy)
            )|>
ggplot(aes(x=year, y=polyarchy, color=group)) +
  geom_line()  +
  facet_wrap(~group) +
  theme_minimal() +
  ylab('Paraguay democracy score')

Differences

Taking the difference of both polyarchy and GDP in this model should eliminate the time trend effect. But it also alters what we’re actually regressing.

differenced_model<-vdem|>
  filter(country_name=='Paraguay')|>
  mutate(diff_polyarchy = v2x_polyarchy - dplyr::lag(v2x_polyarchy),
         diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc))
         
         )|>
  filter(year>=1995)|>
  lm(diff_polyarchy ~ diff_gdp, data=_)

huxreg("original_model"= model, "differenced_model" =differenced_model)
original_modeldifferenced_model
(Intercept)0.278 **0.003 
(0.086)  (0.003)
log(e_gdppc)0.112 **     
(0.032)       
diff_gdp       0.002 
       (0.115)
N25       25     
R20.347   0.000 
logLik53.262   75.351 
AIC-100.524   -144.701 
*** p < 0.001; ** p < 0.01; * p < 0.05.

Differences: change vs. level

A change in the rate of change in the GDP with lead to a change in the rate of change of the democracy score.

huxreg("differenced_model" =differenced_model)
differenced_model
(Intercept)0.003 
(0.003)
diff_gdp0.002 
(0.115)
N25     
R20.000 
logLik75.351 
AIC-144.701 
*** p < 0.001; ** p < 0.01; * p < 0.05.

Differencing pros and cons

  • Pro: can account for the primary time series problem without adding any new parameters

  • Con: is actually a different model! Estimates changes, not levels!

  • Con: transforms the data in a way that makes it difficult to get predictions

Potential fixes: lagged variable model

Another option is to include a lagged DV variable itself as a covariate. This is typically referred to as an autoregressive model.

ar_model<-vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
         lag_gdp = dplyr::lag(e_gdppc)
         
         )|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ lag_polyarchy  + log(e_gdppc)  + log(lag_gdp) , data=_)

AR model

huxreg('autoregressive' = ar_model)
autoregressive
(Intercept)0.110 *  
(0.046)   
lag_polyarchy0.896 ***
(0.099)   
log(e_gdppc)0.169    
(0.140)   
log(lag_gdp)-0.188    
(0.139)   
N25        
R20.913    
logLik78.417    
AIC-146.833    
*** p < 0.001; ** p < 0.01; * p < 0.05.

AR model: multiple lags

One advantage of this approach is that you can simultaneously test multiple lags, which helps if you think there’s more than one source of trend.

ar_model2<-vdem|>
  filter(country_name=='Paraguay')|>
  mutate(lag_polyarchy = dplyr::lag(v2x_polyarchy),
         lag_gdp = dplyr::lag(e_gdppc),
         lag_gdp2 = dplyr::lag(e_gdppc, 2),
         lag_polyarchy2= dplyr::lag(v2x_polyarchy, 2),
         )|>
  filter(year>=1995)|>
  lm(v2x_polyarchy ~ lag_polyarchy  + lag_polyarchy2+
       log(e_gdppc)  + 
       log(lag_gdp)  + 
       log(lag_gdp2), data=_)

huxreg(ar_model2)
(1)
(Intercept)0.129 *  
(0.047)   
lag_polyarchy1.192 ***
(0.199)   
lag_polyarchy2-0.387    
(0.208)   
log(e_gdppc)0.112    
(0.149)   
log(lag_gdp)0.096    
(0.260)   
log(lag_gdp2)-0.216    
(0.157)   
N25        
R20.936    
logLik82.297    
AIC-150.593    
*** p < 0.001; ** p < 0.01; * p < 0.05.

AR model: pros and cons

  • Pro: allows a much more complex model specification and also tests for autocorrelation

  • Con: interpretation is weird, and it adds a lot of parameters and inevitable multicollinearity

Other fixes

  • Including a fixed effect for time can work if you have multiple observations for different periods or if you only suspect a cyclical trend

  • More complicated autocorrelation structures can be modeled with a regression, and then OLS can be used on the residuals from that model

ACF plots

ACF plots can be used to visualize autocorrelations. If the lagged correlation is above the blue line, then there’s statistically meaningful autocorrelation.

vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  filter(year >1990)|>
  mutate(diff_gdp = log(e_gdppc) - log(dplyr::lag(e_gdppc)))|>
  filter(!is.na(diff_gdp))|>
  with(acf(diff_gdp, lag.max = 5, plot = T))

vdem_p<-vdem|>
  filter(country_name=='Paraguay')|>
  arrange(year)|>
  filter(year >1990)|>
  mutate(log_gdp = log(e_gdppc))|>
  filter(!is.na(e_gdppc))|>
  with(acf(log_gdp, lag.max = 5, plot = T))